extractKinja - Another backup solution [UPDATE : Initial batch import support]

Kinja'd!!! "Jb boin" (jb-boin)
11/13/2020 at 03:07 • Filed to: extractKinja, kinjapocalypse, kinjapocalypse 2020, PHP, archive

Kinja'd!!!5 Kinja'd!!! 19

[UPDATE on the 13th] It is now possible to batch import the articles listed on one page such as the mainpage of a blog or the “Posts” page of an author.  

I made a tool to extract articles from Kinja blogs and only keep the content part of the article (no header/footer/comments/”You might also like”) while saving/replacing any external content that would be fetched from Kinja.

No Javascript from Kinja has been reused, i did very minimal code to be able to display the tags menu and open images on a new tab by clicking on them ; the Youtube, Twitter, Vimeo, Dailymotion, Imgur and Instagram widgets/embedding from Kinja have been replaced by “standard” ones.

It’s saving the article(s) on the web server along with a copy of the images and videos (including the author avatar, blog favicon and thumbnail used on the main page) ; for the images, only the highest resolution is kept.

It is still in beta (please let me know if you find some issue or have ideas) , the next step is to do better batch import and if possible recreate the equivalent of the main page : listing posts with the photo, title and name of the author.

How to use - To backup one article

Copy/paste either the ID of the article (eg: 1845644279 for this page) or the fu ll URL (including https://) of the article you want archived at the end of this URL : http://jbboin.phpnet.org/oppo/extractor/extractKinja.php?article=

F or example : http://jbboin.phpnet.org/oppo/extractor/extractKinja.php?article=https://oppositelock.kinja.com/a-general-handbook-for-posting-on-oppositelock-1293992803)

How to use - To backup articles listed on a page

Copy/paste the full URL (including https://) of the page listing posts (like the mainpage of a blog or the page listing posts of a user) you want archived at the end of this URL : !!!error: Indecipherable SUB-paragraph formatting!!! & !!!error: Indecipherable SUB-paragraph formatting!!! !!!error: Indecipherable SUB-paragraph formatting!!!

F or example : !!!error: Indecipherable SUB-paragraph formatting!!!

Y ou can set &u pdate=1 if you want the script to re-fetch the articles already in the archive, if their content have been modified for example .

You can set &maxReplied =X if you want the script to fetch X (between 1 and 100) articles at once .

The operation usually takes around 10 seconds per post (mostly depending on the number and size of the attached photos) and you will have an error “504 Gateway timeout” if it’s longer than 2 minutes but it will continue to run in the background .

What has already been extracted is browsable !!!error: Indecipherable SUB-paragraph formatting!!! (it’s simply a DirectoryIndex at the moment)

Known bugs at the moment

Images galleries are not working but the images/videos are saved anyway (you can access to all the files of the article by removing “article.html” from the URL)

The comments are not integrated on the post, it’s not a bug, it’s a feature (for the time being at least) but they are saved in the articleMetadatas.json

Poster avatar can be stretched in some case : At the moment it’s saving the highest resolution available for this image which might not be the one normally used by Kinja FIXED

Tweets are at the moment fixed in height which crops big ones (with video for example) ; Instagram posts have the same issue

I haven’t worked on the embedded Instagram posts (as i haven’t found one) so it will still use the Kinja widget for the time being FIXED

Vimeo embedding is not working ( !!!error: Indecipherable SUB-paragraph formatting!!! for example) FIXED

Links on the article to other articles are not modified so they won’t be working anymore once the Kinjapocalypse happened

Instgram posts are ( !!!error: Indecipherable SUB-paragraph formatting!!! ) looking a bit... not normal

 

Source code !!!error: Indecipherable SUB-paragraph formatting!!! , if someone is interested.

ps: i did initially put the wrong tool name on the post title... !!!error: Indecipherable SUB-paragraph formatting!!! instead of extractKinja, sorry for the confusion :(


DISCUSSION (19)


Kinja'd!!! Who is the Leader - 404 / Blog No Longer Available > Jb boin
11/11/2020 at 13:09

Kinja'd!!!0

I tried this out for a few of my posts and I’m impressed with how well it works. It keeps links and embeds even. I still have not had time to archive any of my posts. So if I wanted to download these to save in the cloud, how would I do that? I’m not sure if I want to do the chrome ‘save page’ for some of the posts with lots of comments and replace the raw text version. I just am a little overwhelmed right now with other things so I don’t have much time to devote to it. 


Kinja'd!!! Just Jeepin' > Jb boin
11/11/2020 at 13:12

Kinja'd!!!0

I haven’t worked on the embedded Instagram posts (as i haven’t found one) so it will still use the Kinja widget for the time being

Archduke would have a few.


Kinja'd!!! Jb boin > Who is the Leader - 404 / Blog No Longer Available
11/11/2020 at 13:19

Kinja'd!!!1

Well, it’s what this tool is for :)

The next step is to add a function that fetches a list of articles (ei ther per blog/age or per poster) and do a look to fetch many, it just might become a bit much at some point :)


Kinja'd!!! user314 > Jb boin
11/11/2020 at 13:25

Kinja'd!!!0

Looks pretty good. Used it to copy my Flightline post for today over to Hyphen , everything was really slick, aside from the known issues. 


Kinja'd!!! Who is the Leader - 404 / Blog No Longer Available > Jb boin
11/11/2020 at 13:35

Kinja'd!!!0

My tech-foo is pretty lackluster. I could easily manually pick around 30 or 40 posts that I want to save out of my 300 or so by hand but I’m just not sure how I want to go about saving them to a hard drive or a cloud storage or both.


Kinja'd!!! Jb boin > Who is the Leader - 404 / Blog No Longer Available
11/11/2020 at 13:49

Kinja'd!!!0

I mean, the  tool saves it itself, meaning that if you visit it again after the Kinjapocalypse it will still work the same.


Kinja'd!!! Who is the Leader - 404 / Blog No Longer Available > Jb boin
11/11/2020 at 13:53

Kinja'd!!!0

Yes, just wondering how I could save the .html format locally to my computer.


Kinja'd!!! Jb boin > user314
11/11/2020 at 13:53

Kinja'd!!!0

But you still had to re- insert the images one by one, right?


Kinja'd!!! user314 > Jb boin
11/11/2020 at 14:42

Kinja'd!!!0

Yeah. I also got duplicated captions when I copy/pasted the text, but i think that's a browser issue. 


Kinja'd!!! duurtlang > Jb boin
11/12/2020 at 04:00

Kinja'd!!!0

Is there a way to:

Use a .txt with urls so you don’t need to o pen every single one manually?

include comments?

Download the articles in a format that is not as restricted as PDF? As in, with usable images.


Kinja'd!!! Jb boin > duurtlang
11/12/2020 at 04:57

Kinja'd!!!0

At the moment i havent finished the part of the script that will do automatic retrieval of “all” posts of a blog or user but i need to fix before all the small bugs like the va rious embedded that are dependent on Kinja to avoid to have to re-modifiy what has been already archived once.

Comments are a whole different beast , the data is fetched by browsers directly in JSON and it’s JavaScript code that is generating the HTML code, for example the comments on this article as loaded by browsers look like this ! I will try to fetch the JSON and maybe later try to make them come back to life but it might not be a simple task, i should also try to fetch image from the comment if i can.

As is, the articles will still be available online after having been archived and you can browse the files ; but if you really want to save them on your computer you can either save it directly or i can also send you an archive with the content.

Adding you long post made me realize that there is also a Google maps embedding (didn’t even remember it was a thing) that i have to fix as well as the images that have text next to them being... oversize ; will check that.


Kinja'd!!! Jb boin > Jb boin
11/12/2020 at 05:05

Kinja'd!!!0

ps: “images that have text next to them” (sorry, don’t have the right name for it) are now fixed.


Kinja'd!!! Jb boin > Who is the Leader - 404 / Blog No Longer Available
11/12/2020 at 05:14

Kinja'd!!!1

Yes if you really want to, on Kinja some of the HTML code is generated/modified by JavaScript which makes the pages somewhat broken if only the HTML code is exported, it’s why the script does that many things, to try to make it simpler and static which would make the export simpler.


Kinja'd!!! duurtlang > Jb boin
11/12/2020 at 05:24

Kinja'd!!!0

I’ll give you an example: https://oppositelock.kinja.com/oppomeet-europe-2019-when-and-where-we-need-suggestio-1829503303

I’d really like to be able to save such a thing, including the comments. If you see the comments, you’ll probably understand. It’s a rather helpful archive/tool for the future.

N one of this is your responsibility in any way or form. And I have no idea how much work it is, as I can’t code myself. But I really appreciate the help :)


Kinja'd!!! Jb boin > duurtlang
11/12/2020 at 05:50

Kinja'd!!!0

If you want this article without the comments it’ s : http://jbboin.phpnet.org/oppo/extractor/extractKinja.php?article=https://oppositelock.kinja.com/oppomeet-europe-2019-when-and-where-we-need-suggestio-1829503303

The comments are tricki er, at least extracting all images from them   doesn’t look to be too hard but re-creating the text with formatting and images in the middle of it might be close to impossible (for me at least) .

I totally get why you want them but i am not a developer either so i can’t do wonder :(


Kinja'd!!! BvdV - The Dutch Engineer > duurtlang
11/12/2020 at 06:33

Kinja'd!!!0

I’m working on the foolproof ;) version of that. Expect to finish it tonight. I’ll keep you posted. It takes a txt list as input, and outputs separate images files and a text article

No comments though sadly


Kinja'd!!! duurtlang > BvdV - The Dutch Engineer
11/12/2020 at 06:44

Kinja'd!!!0

I’ll take anything I can get. I have PDFs, which is convenient  but is shitty for pictures. There is Jb boin’s nicer solution, which requires me the click a link for each article (I used my .txt file and added his url in front of each link) and then there’s yours. As long as there is a usable backup I’d be happy.


Kinja'd!!! BvdV - The Dutch Engineer > duurtlang
11/12/2020 at 15:01

Kinja'd!!!0

In case you want to try it: https://oppositelock.kinja.com/kinjaextractor-an-easy-way-to-back-up-your-posts-v2-1845657614


Kinja'd!!! Jb boin > BvdV - The Dutch Engineer
11/12/2020 at 23:18

Kinja'd!!!1

Damn, i didn’t even realize i did put the wrong name on the title of t his post ... what an idiot !